Combining Program and Data Specialization
نویسندگان
چکیده
Program and data specialization have always been studied separately although they are both aimed at processing early computations On the one hand program specialization encodes the result of early computations into a new program and on the other hand data specialization encodes the result of early computations in data structures In this paper we present an extension of the Tempo specializer which performs both program and data specialization We show how these two strategies can be integrated in a unique specializer This new kind of specializer provides the programmer with complementary strategies which widen the scope of specialization We illustrate the bene ts and limitations of these strategies and their combination on a variety of programs Introduction Program and data specialization are both aimed at performing computations which depend on early values However they di er in the way the result of early computations are encoded on the one hand program specialization encodes these results in a residual program and on the other hand data specialization encodes these results in data structures More precisely program specialization performs a computation when it only relies on early data and inserts the textual representation of its result in the residual program when it is surrounded by computations depending on late values In essence it is because a new program is being constructed that early computations can be encoded in it Furthermore because a new program is being constructed it can be pruned that is the residual program only corresponds to the control ow which could not be resolved given the available data As a consequence program specialization optimizes the control ow since fewer control decisions need to be taken However because it requires a new program to be constructed program specialization can lead to code explosion if the size of the specialization values is large For example this situation can occur when a loop needs to be unrolled and the number of iterations is high Not only does code explosion cause code size problems but it also degrades the execution time of the specialized program dramatically because of instruction cache misses The dual notion to specializing programs is specializing data This strategy consists of splitting the execution of program into two phases The rst phase called the loader performs the early computations and stores their results in a data structure called a cache Instead of generating a program which contains the textual representation of values data specialization generates a program to perform the second phase it only consists of the late computations and is parameterized with respect to the result of early computations that is the cache The corresponding program is named the reader Because the reader is parameterized with respect to the cache it is shared by all specializations This strategy fundamentally contrasts with program specialization because it decouples the result of early computations and the program which exploits it As a consequence as the size of the specialization problem increases only the cache parameter increases not the program In practice data specialization can handle problem sizes which are far beyond the reach of program specialization and thus opens up new opportunities as demonstrated by Knoblock and Ruf for graphics applications However data specialization by de nition does not optimize the control ow it is limited to performing the early computations which are expensive enough to be worth caching Because the reader is valid for any cache it is passed an early control decision leading to a costly early computation needs to be part of the loader as well as the reader in the loader it decides whether the costly computation much be cached in the reader the control decision determines whether the cache needs to be looked up In fact data specialization does not apply to programs whose bottlenecks are limited to control decisions A typical example of this situation is interpreters for low level languages the instruction dispatch is the main target of specialization For such programs data specialization can be completely ine ective Perhaps the apparent di erence in the nature of the opportunities addressed by program and data specialization has led researchers to study these strategies in isolation As a consequence no attempt has ever been made to integrate both strategies in a specializer further there exist no experimental data to assess the bene ts and limitations of these specialization strategies In this paper we study the relationship between program and data specialization with re spect to their underlying concepts their implementation techniques and their applicability More precisely we study program and data specialization when they are applied separately as well as when they are combined Section Furthermore we describe how a specializer can integrate both program and data specialization what components are common to both strategies and what components di er In practice we have achieved this integration by extending a program special izer named Tempo with the phases needed to perform data specialization Section Finally we assess the bene ts and limitations of program and data specialization based on experimental data collected by specializing a variety of programs exposing various features Section Concepts of Program and Data Specialization In this section the basic concepts of both program and data specialization are presented The limitations of each strategy are identi ed and illustrated by an example Finally the combination of program and data specialization is introduced Program Specialization The partial evaluation community has mainly been focusing on specialization of programs That is given some inputs of a program partial evaluation generates a residual program which encodes the result of the early computations which depend on the known inputs Although program specialization has successfully been used for a variety of applications e g operating systems scienti c programs and compiler generation it has shown some limitations One of the most fundamental limitations is code explosion which occurs when the size of the specialization problem is large Let us illustrate this limitation using the procedure displayed on the left hand side of Figure In this example stat is considered static whereas dyn and d are dynamic Static constructs are printed in boldface Assuming the specialization process unrolls the loop variable i becomes static and thus the gi procedures i e g g and g can be fully evaluated Even if the gi procedures correspond to non expensive computations program specialization still optimizes procedure f in that it simpli es its control ow the loop and one of the conditionals are eliminated A possible specialization of procedure f is presented on the right hand side of Figure However beyond some number of iterations the unrolling of a loop and the computations it enables do not pay for the size of the resulting specialized program this number depends on the processor features In fact as will be shown later the specialized program can even get slower than the unspecialized program The larger the size of the residual loop body the earlier this phenomenon happens void f int stat int dyn int d void f int dyn int d int j if E dyn d dyn for j j stat j d dyn f E dyn d dyn if E stat d j g j dyn d dyn if E dyn d j g j dyn if E dyn d dyn d j g j dyn d dyn if E dyn d dyn d dyn a Source program b Specialized program Figure Program specialization For domains like graphics and scienti c computing some applications are beyond the reach of program specialization because the specialization opportunities rely on very large data or iteration bounds which would cause code explosion if loops traversing these data were unrolled In this situation data specialization may apply Data Specialization In late eighties an alternative to program specialization called data specialization was introduced by Barzdins and Bulyonkov and further explored by Malmkj r Later Knoblock and Ruf studied data specialization for a subset of C and applied it to a graphics application Data specialization is aimed at encoding the results of early computations in data structures not in the residual program The execution of a program is divided into two stages a loader rst executes the early computations and saves their result in a cache Then a reader performs the remaining computations using the result of the early computations contained in the cache Let us illustrate this process by an example displayed in Figure On the left hand side of this gure a procedure f is repeatedly invoked in a loop with a rst argument c which does not vary and is thus considered early its second argument the loop index k varies at each iteration Procedure f is also passed a di erent vector at each iteration which is assumed to be late Because this procedure is called repeatedly with the same rst argument data specialization can be used to perform the computations which depend on it In this context many computations can be performed namely the loop test E stat and the invocation of the gi procedures Of course caching an expression assumes that its execution cost exceeds the cost of a cache reference Measurements have shown that caching expressions which are too simple e g a variable occurrence or simple comparisons actually cause the resulting program to slow down In our example let us assume that like the loop test the cost of expression E stat is not expensive enough to be cached If however the gi procedures are assumed to consist of expensive computations their invocations need to be examined as potential candidate for caching Since the rst conditional test E stat is early it can be put in the loader so that whenever it evaluates to true the invocation of procedure g can be cached similarly in the reader the cache is looked up only if the conditional test evaluates to true However the invocation of procedure g cannot be cached according to Knoblock and Ruf s strategy since it is under dynamic control and thus caching its result would amount to performing speculative evaluation Finally the invocation of procedure g needs to be cached since it is unconditionally executed and its argument is early The resulting loader and reader for procedure f are presented on the right hand side of Figure as well as their invocations extern int w N M extern int w N M struct data cache int val int val cache MAX f load c cache for k k MAX k for k k MAX k f c k w k f read c k w k cache void f load stat cache int stat void f int stat int dyn int d struct data cache cache int j int j for j j stat j for j j stat j if E stat cache j val g j cache j val g j if E stat d j g j dyn if E dyn d j g j dyn d j g j dyn void f read stat dyn d cache int stat dyn d struct data cache cache int j for j j stat j if E stat d j cache j val dyn if E dyn d j s j dyn d j cache j val dyn a Source program b Specialized program Figure Data specialization To study the limitations of data specialization consider a program where computations to be cached are not expensive enough to amortize the cost of memory reference In our example assume the gi procedures correspond to such computations Then only the control ow of procedure f remains a target for specialization Combining Program and Data Specialization We have shown the bene ts and limitations of both program and data specialization The main parameters to determine which strategy ts the specialization opportunities are the cost of the early computations and the size of the specialization problem Obviously within the same pro gram or even a procedure some fragments may require program specialization and others data specialization As a simple example consider a procedure which consists of two nested loops The innermost loop may require few iterations and thus allow program specialization to be applied Whereas the outermost loop may iterate over a vector whose size is very large this may pre vent program specialization from being applied but not data specialization from exploiting some opportunities Concretely performing both program and data specialization can be done in a simple way One approach consists of doing data specialization rst and then applying the program specializer on either the loader or the reader or both The idea is that code explosion may not be an issue in one of these components as a result program specialization can further optimize the loader or the reader by simplifying its control ow or performing speculative specialization For example a reader may consist of a loop whose body is small this situation may thus allow the loop to be unrolled without causing the residual program to be too large Applying a program specializer to both the reader and the loader may be possible if the fragments of the program which may cause code explosion are made dynamic Alternatively program specialization can be performed prior to data specialization This combination requires program specialization to be applied selectively so that only fragments which do not cause code explosion are specialized Then the other fragments o ering specialization opportunities can be processed by data specialization As is shown in Section in practice combining both program and data specialization allows better performance than pure data specialization and prevents the performance gain from dropping as quickly as in the case of program specialization as the problem size increases Integrating Program and Data Specialization We now present how Tempo is extended to perform data specialization To do so let us brie y describe its features which are relevant to both data specialization and the experiments presented in the next section Tempo Tempo is an o line program specializer for C programs As such specialization is preceded by a preprocessing phase This phase is aimed at computing information to guide the specialization process The main analyses of Tempo s preprocessing phase are an alias analysis a side e ect analyses a binding time analysis and an action analysis The rst two analyses are needed because of the imperative nature of the C language whereas the binding time analysis is typical of any o line specializer The action analysis is more unusual it computes the specialization actions i e the program transformations to be performed by the specialization phase The output of the preprocessing phase is a program annotated with specialization actions Given some specialization values this annotated program can be used by the specialization phase to produce a residual program at compile time as is traditionally done by partial evaluators In addition Tempo can specialize a program at run time Tempo s run time specializer is based on templates which are e ciently compiled by standard C compilers Tempo has been successfully used for a variety of applications ranging from operating systems to scienti c programs Extending Tempo with Data Specialization Tempo includes a binding time analysis which propagates binding times forward and backward The forward analysis aims at determining the static computations it propagates binding times from the de nitions to the uses of variables The backward analysis performs the same propagation in the opposite direction when uses of a variable are both static and dynamic its de nition is annotated static dynamic This annotation indicates that the de nition should be evaluated both at specialization time and run time This process introduced by Hornof et al allows a binding time analysis to be more accurate such an analysis is said to be use sensitive When a de nition is static dynamic and occurs in a control construct e g while this control construct becomes static dynamic as well The specialized program is the code where constructs and expressions annoted static are evaluated at specialization time and its result are introduced in the residual code and where constructs and expressions annoted dynamic or static dynamic are rebuilt in the residual code To perform data specialization an analysis is inserted between the forward analysis and back ward analysis In essence this new phase identi es the frontier terms that is static terms oc curring in a dynamic or static dynamic context If the cost of the frontier term is below a given threshold de ned as a parameter of the data specializer it is forced to dynamic or static dynamic Furthermore because data specialization does not perform speculative evaluation static com putations which are under dynamic control are made dynamic Once these adjustments are done the backward phase of the binding time analysis then deter mines the nal binding times of the program Later in the process the static computations are included in the loader and the dynamic computations in the reader the frontier terms are cached The rest of our data specializer is the same as Knoblock and Ruf s Performance Evaluation In this section we compare the performance obtained by applying di erent specialization strategies on a set of programs This set includes several scienti c programs and a system program Overview Machine and Compiler The measurements presented in this paper were obtained using a Sun UltraSPARC Model with mega bytes of main memory running Sun OS version Times were measured using the Unix system call getrusage and include both user and system times Figure displays the speedups and the size increases of compiled code obtained for di erent specialization strategies For each benchmark we give the program invariant used for specialization and an approximation of its time complexity The code sources are included in the appendices All the programs were compiled with gcc O Higher degrees of optimization did not make a di erence for the programs used in this experiment Specialization strategies We evaluate the performance of ve di erent specialization meth ods The speedup is the ratio between the execution times of the specialized program and the original one The size increases is the ratio between the size of the specialized program and the original one The data displayed in Figure correspond to the behavior of the following special ization strategies PS CT the program is program specialized at compile time PS RT the program is program specialized at run time DS the program is data specialized DS PS CT the program is data specialized and program specialized at compile time The loops which manipulate the cache for data specialization are kept dynamic to avoid code explosion DS PS RT the program is data specialized and program specialized at run time As in the previous strategy the loops which manipulate the cache are kept dynamic to avoid code explosion Source programs We consider a variety of source programs a one dimensional fast Fourier transformation FFT a Chebyshev approximation a Romberg integration a Smirnov integration a cubic spline interpolation and a Berkeley packet lter BPF Given the specialization strategies available these programs can be classi ed as follows Control ow intensive A program which mainly exposes control ow computations data ow computations are inexpensive In this case program specialization can improve performance whereas data specialization does not because there is no expensive calculations to cache Data ow intensive A program which is only based on expensive data ow computations As a result program specialization at compile time as well as data specialization can improve the performance of such program Control and data ow intensive A program which contains both control ow computations and expensive data ow computations Such program is a good candidate for program specialization at compile time when applied to small values and well suited for data special ization when applied to large values We now analyze the performance of ve specialization methods in turn on the benchmark programs Results Data specialization can be executed at compile time or at run time At run time the loader of the cache is executed before the execution of the specialized program while at compile time the cache is constructed before the compilation The cache is then used by the specialized program during the execution For all programs data specialization yields a greater speedup than program specialization at run time The combination of these two specialization strategies does not make a better result In this section we characterize di erent opportunities of specialization to illustrate our method in the three categories of program Program Specialization We analyze two programs where performance is better with program specialization the Berkeley packet lter BPF which interprets a packet with respect to an interpreter program and the cubic spline interpolation which approximates a function using a third degree polynomial equation Characteristics For the BPF the program consists exclusively of conditionals whose tests and branches contain inexpensive expressions For the cubic spline interpolation the program consists of small loops whose small body can be evaluated in part Concretely a program which mainly depends on the control ow graph and whose leaves contain few calculations but partially reducible is a good candidate for program specialization By program specialization the control ow graph is reduced and some calculations are eliminated Since there is no static calculation expensive enough to be e ciently cached by data specialization the specialized program is mostly the same as the original one For this kind of programs only program specialization gives signi cant improvements it reduces the control ow graph and it produces a small specialized program Applications The BPF Appendix F is specialized with respect to a program of size n It mainly consists of the conditionals its time complexity is linear in the size of the program and it does not contain expensive data computations As the program does not contain any loop the size of the specialized program is mostly the same as the original one In Figure F program specialization at compile time and at run time yields a good speedup whereas data specialization only improves performance marginally The combination of program and data specialization does not improve the performance further The cubic spline interpolation Appendix E is specialized with respect to the number of points n and their x coordinates It contains three singly nested loops its time complexity is O n In the rst two loops more than half of the computations of each body can be completely evaluated or cached by specialization including real multiplications and divisions Nevertheless there is no expensive calculation to cache and data specialization does not improve performance signi cantly The unrolled loop does not really increase the code size because of the small complexity of the program and the small body of the loop As a consequence for each number of points n the Figure Program data and combined specializations speedup of each specialization barely changes In Figure E program specialization at compile time produces a good speedup whereas program specialization at run time does not improve performance Data specialization obtains a minor speedup because the cached calculations are not expensive Program Specialization or Data Specialization We now analyze two programs where performance is identical to program specialization or data specialization the polynomial Chebyshev which approximates a continuous function in a known interval and the Smirnov integration which approximates the integral of a function on an interval using estimations Characteristics These two programs only contain loops and expensive calculations in doubly nested loops As for the cubic spline interpolation Section more than half of the computa tions of each body loop can be completely evaluated or cached by specialization In contrast with cubic spline interpolation the static calculations in Chebyshev and Smirnov are very expensive and allow data specialization to yield major improvements For the combined specialization data specialization is applied to the innermost loop and program specialization is applied to the rest of the program For this kind of programs program and data specialization both give signi cant improvements However for the same speedup the code size of the program produced by program specialization is a hundred times larger than the specialized program using data specialization Applications The Chebyshev approximation Appendix C is specialized with respect to the degree n of the generated polynomial This program contains two calls to the trigonometric function cos one of them in a singly nested loop and the other call in a doubly nested loop Since this program mainly consists of data ow computations program specialization and data specialization obtain similar speedups see Figure C The Smirnov integration Appendix D is specialized with respect to the number of iterations n m The program contains a call to the function fabs which returns the absolute value of its parameter This function is contained in a doubly nested loop and the time complexity of this program is O m As in the case of Chebyshev program and data specialization produce similar speedups see Figure D Combining Program Specialization and Data Specialization Finally we analyze two programs where performance improves using program specialization when values are small and data specialization when values are large the FFT and the Romberg in tegration The FFT converts data from the time domain to a frequency domain The Romberg integration approximates the integral of a function on an interval using estimations Characteristics These two programs contain several loops and expensive data ow computa tions in doubly nested loops however more than half of the computations of each loop body cannot be evaluated Beyond some number of iterations when the program specialization unrolls these loops it increases the code size of the specialized program and then degrades performance The specialized program becomes slower because of its code size Furthermore beyond some problem size the specialization process cannot produce the program because of its size In contrast data specialization only caches the expensive calculations does not unroll loops and improves perfor mance The result is that the code size of the program produced by program specialization is a hundred times larger than the specialized program using data specialization for a speedup gain of The combined specialization delays the occurrence of code explosion Data specialization is applied to the innermost loop which contains the cache computations and program specialization is applied to the rest of the program Applications The FFT Appendix A is specialized with respect to the number of data points N It contains ten loops with several degrees of nesting One of these loops with complex ity O N contains four calls to trigonometric functions which can be evaluated by program specialization or cached by data specialization Due to the elimination of these expensive library calls program specialization and data specialization produce signi cant speedups see Figure A However in the case of program specialization code unrolling degrade performance In contrast data specialization produces a stable speedup regardless of the number of data points When N is smaller than data specialization does not obtain a better result in comparison to program specialization However when N is greater than program specialization becomes impossible to apply because of the specialization time and the size of the residual code In this situation data specialization still gives better performance than the unspecialized program Because this program also contains some conditionals combined specialization where the innermost loop is not unrolled improves performance better than data specialization alone The Romberg integration Appendix B is specialized with respect to the number of iterations M used in the approximation The Romberg integration contains two calls to the costly function intpow It is called twice once in a singly nested loop and another time in a doubly nested loop Because both specialization strategies eliminate these expensive library calls the speedup is consequently good As for FFT loop unrolling causes the program specialization speedup to decrease whereas the data specialization speedup still remains the same even when M increases Figure B Conclusion We have integrated program and data specialization in a specializer named Tempo Importantly data specialization can re use most of the phases of an o line program specializer Because Tempo now o ers both program and data specialization we have experimentally compared both strategies and their combination This evaluation shows that on the one hand program specialization typically gives better speed up than data specialization for small problem size However as the problem size increases the residual programmay become very large and often slower than the unspecialized program On the other hand data specialization can handle large problem size without much performance degradation This strategy can however be ine ective if the program to be specialized mainly consists of control ow computations The combination of both program and data specialization is promising it can produce a residual program more e cient than with data specialization alone without dropping in performance as dramatically as program specialization as the problem size increases Acknowledgments We thank Renaud Marlet for thoughtful comments on earlier versions of this paper as well as the Compose group for stimulating discussions A substantial amount of the research reported in this paper builds on work done by the authors with Scott Thibault on Berkeley packet lter and Julia Lawall on Fast Fourier Transformation References G J Barzdins and M A Bulyonkov Mixed computation and translation Linearisation and decomposition of compilers Preprint from Computing Centre of Sibirian division of USSR Academy of Sciences p Novosibirsk C Consel and O Danvy Tutorial notes on partial evaluation In Conference Record of the Twentieth Annual ACM SIGPLAN SIGACT Symposium on Principles Of Programming Languages pages Charleston SC USA January ACM Press C Consel and F No l A general approach for run time specialization and its application to C In Conference Record of the rd Annual ACM SIGPLAN SIGACT Symposium on Principles Of Programming Languages pages St Petersburg Beach FL USA January ACM Press B Guenter T B Knoblock and E Ruf Specializing shaders In Computer Graphics Pro ceedings Annual Conference Series pages ACM Press L Hornof J Noy and C Consel E ective specialization of realistic programs via use sensitivity In P Van Hentenryck editor Proceedings of the Fourth International Symposium on Static Analysis SAS volume of Lecture Notes in Computer Science pages Paris France September Springer Verlag N D Jones C Gomard and P Sestoft Partial Evaluation and Automatic Program Genera tion International Series in Computer Science Prentice Hall June T B Knoblock and E Ruf Data specialization In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation pages Philadel phia PA May ACM SIGPLAN Notices Also TR MSR TR Microsoft Research February J L Lawall Faster Fourier transforms via automatic program specialization Publication interne IRISA Rennes France May K Malmkjaer Program and data specialization Principles applications and self application Master s thesis DIKU University of Copenhagen August G Muller R Marlet E N Volanschi C Consel C Pu and A Goel Fast optimized Sun RPC using automatic program specialization In Proceedings of the th International Conference on Distributed Computing Systems pages Amsterdam The Netherlands May IEEE Computer Society Press G Muller E N Volanschi and R Marlet Scaling up partial evaluation for optimizing the Sun commercial RPC protocol In ACM SIGPLAN Symposium on Partial Evaluation and Semantics Based Program Manipulation pages Amsterdam The Netherlands June ACM Press F No l L Hornof C Consel and J Lawall Automatic template based run time special ization Implementation and experimental study In International Conference on Computer Languages pages Chicago IL May IEEE Computer Society Press Also available as IRISA report PI A Fast Fourrier Transformation
منابع مشابه
Combining Program and Data Specialization SANDRINE CHIROKOFF CHARLES CONSEL AND RENAUD MARLET
Program and data specialization have always been studied separately, although they are both aimed at processing early computations. Program specialization encodes the result of early computations into a new program; while data specialization encodes the result of early computations into data structures. In this paper, we present an extension of the Tempo specializer, which performs both program...
متن کاملFaster Run-time Specialized Code using Data Specialization
Run-time specialization is a technique that optimizes a program based on run-time information. In this context, specialization time must be constrained, limiting the possibility to further optimize the specialized code. We present a low-cost methodology for improving the code generated by a run-time specializer. This result is acheived by combining run-time specialization with another form of a...
متن کاملLearning of Constraint Logic Programs by Combining Unfolding and Slicing Techniques
This paper discusses learning of Constraint Logic Programs using unfolding and slicing technique. The transformation rule for unfolding together with clause removal is a method for specialization of Logic Programs. Slicing is a program analysis technique originally developed for imperative languages. It facilitates the understanding of data flow and debugging. This paper formulates the semantic...
متن کاملOptimization of XSLT by Compact Specialization and Combination
In recent times, there has been an increased utilization of serverside XSLT systems as part of e-commerce and e-publishing applications. For the high volumes of data in these applications, effective optimization techniques for XSLT are particularly important. In this paper, we propose two new optimiza-tion approaches, Specialization Combination and Specialization Set Compac-tion, to help improv...
متن کاملA Perfect Specialization Model for Gravity Equation in Bilateral Trade based on Production Structure
Although initially originated as a totally empirical relationship to explain the volume of trade between two partners, gravity equation has been the focus of several theoretic models that try to explain it. Specialization models are of great importance in providing a solid theoretic ground for gravity equation in bilateral trade. Some research papers try to improve specialization models by addi...
متن کامل